Search CORE

354 research outputs found

Inferring ancestral sequences in taxon-rich phylogenies

Author: Gascuel Olivier
Steel Mike
Publication venue
Publication date: 01/01/2010
Field of study

Statistical consistency in phylogenetics has traditionally referred to the accuracy of estimating phylogenetic parameters for a fixed number of species as we increase the number of characters. However, as sequences are often of fixed length (e.g. for a gene) although we are often able to sample more taxa, it is useful to consider a dual type of statistical consistency where we increase the number of species, rather than characters. This raises some basic questions: what can we learn about the evolutionary process as we increase the number of species? In particular, does having more species allow us to infer the ancestral state of characters accurately? This question is particularly relevant when sequence site evolution varies in a complex way from character to character, as well as for reconstructing ancestral sequences. In this paper, we assemble a collection of results to analyse various approaches for inferring ancestral information with increasing accuracy as the number of taxa increases.Comment: 32 pages, 5 figures, 1 table

arXiv.org e-Print Archive

CiteSeerX

Mathematical and Computational Evolutionary Biology (2013)

Author: Gascuel Olivier
Stadler Tanja
Publication venue
Publication date: 02/08/2017
Field of study

RERO DOC Digital Library

Fast NJ-like algorithms to deal with incomplete distance matrices

Author: Criscuolo Alexis
Gascuel Olivier
Publication venue
Publication date: 01/03/2008
Field of study

RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract Background Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ 12) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices. Results We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ 3 and MVR 4. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE 5. Steps (b) and (c) are also modified, but O(n3) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ*, BIONJ* and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species. Conclusion Our distance-based agglomerative algorithms NJ*, BIONJ* and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method 6 to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques 7. Binaries and all simulated data are downloadable from 8.Published versio

Springer - Publisher Connector

Directory of Open Access Journals

The combinatorics of overlapping genes

Author: Gascuel Olivier
Lebre Sophie
Publication venue
Publication date: 01/01/2016
Field of study

Overlapping genes exist in all domains of life and are much more abundant than expected at their first discovery in the late 1970s. Assuming that the reference gene is read in frame +0, an overlapping gene can be encoded in two reading frames in the sense strand, denoted by +1 and +2, and in three reading frames in the opposite strand, denoted by -0, -1 and -2. This motivated numerous researchers to study the constraints induced by the genetic code on the various overlapping frames, mostly based on information theory. Our focus in this paper is on the constraints induced on two overlapping genes in terms of amino acids, as well as polypeptides. We show that simple linear constraints bind the amino acid composition of two proteins encoded by overlapping genes. Novel constraints are revealed when polypeptides are considered, and not just single amino acids. For example, in double-coding sequences with an overlapping reading frame -2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas specific words (e.g. YY) never occur. We thus distinguish between null constraints (YY = 0 in frame -2) and non-null constraints (Y in frame +0 Y in frame -2). Our equivalence-based constraints are symmetrical and thus enable the characterization of the joint composition of overlapping proteins. We describe several formal frameworks and a graph algorithm to characterize and compute these constraints. These results yield support for understanding the mechanisms and evolution of overlapping genes, and for developing novel overlapping gene detection methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Pasteur

A 'stochastic safety radius' for distance-based tree reconstruction

Author: Gascuel Olivier
Steel Mike
Publication venue
Publication date: 14/11/2014
Field of study

A variety of algorithms have been proposed for reconstructing trees that show the evolutionary relationships between species by comparing differences in genetic data across present-day taxa. If the leaf-to-leaf distances in a tree can be accurately estimated, then it is possible to reconstruct this tree from these estimated distances, using polynomial-time methods such as the popular `Neighbor-Joining' algorithm. There is a precise combinatorial condition under which distance-based methods are guaranteed to return a correct tree (in full or in part) based on the requirement that the input distances all lie within some `safety radius' of the true distances. Here, we explore a stochastic analogue of this condition, and mathematically establish upper and lower bounds on this `stochastic safety radius' for distance-based tree reconstruction methods. Using simulations, we show how this notion provides a new way to compare the performance of distance-based tree reconstruction methods. This may help explain why Neighbor-Joining performs so well, as its stochastic safety radius appears close to optimal (while its more classical safety radius is the same as many other less accurate methods).Comment: 18 pages, 1 figure, 4 table

arXiv.org e-Print Archive

CiteSeerX

Inferring evolutionary trees with strong combinatorial evidence

Author: Berry Vincent
Gascuel Olivier
Publication venue: University of Warwick. Department of Computer Science
Publication date
Field of study

We consider the problem of inferring the evolutionary tree of a set of n species. We propose a quartet reconstruction method which specifically produces trees whose edges have strong combinatorial evidence. Let Q be a set of resolved quartets defined on the studied species, the method computes the unique maximum subset Q* of Q which is equivalent to a tree and outputs the corresponding tree as an estimate of the species' phylogeny. We use a characterization of the subset Q* due to (Bandelt86) to provide an O(n4) incremental algorithm for this variant of the NP-hard quartet consistency problem. Moreover, when chosing the resolution of the quartets by the Four-Point Method (FPM) and considering the Cavender-Farris model of evolution, we show that the convergence rate of the Q* method is at worst polynomial when the maximum evolutive distance between two species is bounded. We complete these theoretical results by an experimental study on real and simulated data sets. The results show that (i) as expected, the strong combinatorial constraints it imposes on each edge leads the Q* method to propose very few incorrect edges; (ii) more surprisingly, the method infers trees with a relatively high degree of resolution

Warwick Research Archives Portal Repository

Deep conservation of human protein tandem repeats within the eukaryotes

Author: Anisimova Maria
Gascuel Olivier
Schaper Elke
Publication venue: Oxford University Press
Publication date: 01/01/2014
Field of study

Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture for proteins TRs, we performed a proteome-wide analysis of the mode of evolution for human TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs we reconstructed bi-species TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≥61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Mya ago. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE

Repository for Publications and Research Data

CiteSeerX

INRIA a CCSD electronic archive server

PubMed Central

ZHAW digitalcollection

Les espaces de l'halieutique

Author: Fonteneau Alain
Gascuel D.
Maury Olivier
Publication venue: 'The Japanese Bird Banding Association'
Publication date: 01/01/2000
Field of study

L'objet de l'article est la présentation d'un modèle spatialisé forcé par l'environnement de la population de thons albacore de l'Atlantique. Le modèle s'appuie sur des relations non linéaires estimées par modélisation additive généralisée (GAM) caractérisant, d'une part les préférences environnementales des albacores et d'autre part leur capturabilité à différents engins. Formulées analytiquement, les relations caratéristiques des préférences environnementales des albacores sont utilisées pour forcer un modèle d'advection-diffusion-réaction des albacores. Egalement formulées analytiquement, les relations caractérisant la capturabilité à différents engins permettent d'envisager l'ajustement du modèle aux captures observées. Le modèle permet de simuler la répartition des animaux en fonction de l'environnement océanique et des captures réelles. A travers différentes simulations, on s'intéresse au phénomène de surexploitation locale des thons adultes dans le Golfe de Guinée. La très grande ampleur du phénomène observée dans les simulations est discutée. (Résumé d'auteur

Horizon / Pleins textes

Deep conservation of human protein tandem repeats within the eukaryotes

Author: Anisimova Maria
Gascuel Olivier
Schaper Elke
Publication venue: Oxford University Press
Publication date: 01/01/2014
Field of study

ZHAW digitalcollection

Rapidly Computing the Phylogenetic Transfer Index

Author: Gascuel Olivier
Swenson Krister M.
Truszkowski Jakub
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)
Publication date: 01/01/2019
Field of study

Given trees T and T_o on the same taxon set, the transfer index phi(b,T_o) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T_o. Recently, Lemoine et al. [Lemoine et al., 2018] used the transfer index to design a novel bootstrap analysis technique that improves on Felsenstein\u27s bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b in T in O(n log^3 n) time, which improves upon the current O(n^2)-time algorithm by Lin, Rajan and Moret [Lin et al., 2012]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors)

HAL Descartes

Dagstuhl Research Online Publication Server

HAL-Pasteur